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INTRODUCTION: 


Type  1  Diabetes  (T1D)  is  associated  with  increased  risk  of  T1  D-Nephropathy  (T1DN)  and  is  usually  accompanied  by 
other  diabetic-related  complications  such  as  retinopathy,  neuropathy,  blood  pressure  elevation,  and  high  risk  of 
cardiovascular  morbidity  and  mortality.  Sixteen  million  people  in  the  US  have  diabetes  with  800,000  new  cases 
diagnosed  each  year.  Diabetic  complications  affect  most  diabetic  patients.  Diabetes  occurs  in  men,  women,  children  and 
the  elderly.  African,  Hispanic,  Native  and  Asian  Americans  are  particularly  susceptible  to  its  most  severe  complications. 
An  estimated  20%  to  40%  of  T1D  patients  will  develop  diabetic  nephropathy,  clinically  first  evidenced  by 
microalbuminuria,  during  their  lifetime.  If  untreated  nearly  all  T1 D  patients  experiencing  microalbuminuria  will  progress  to 
overt  nephropathy,  evidenced  by  macroalbuminuria,  and  culminating  in  TID-End  Stage  Renal  Disease  (T1D-ESRD). 
Improved  prediction  of  risk  for  developing  diabetes  and  diabetic  complications  among  active  duty  members  of  the  military, 
their  families  and  retired  military  personnel  will  potentially  allow  focused  preventative  treatment  of  at-risk  individuals, 
providing  significant  healthcare  savings  and  improved  patient  well  being. 

BODY: 

Our  first  quarterly  scientific  progress  report  for  the  third  year  of  our  project  (08/27/09  -  11/30/09) 
detailed  the  following  steps  forward  in  reaching  the  aims  of  our  study. 

In  previous  Quarterly  Reports  we  addressed  an  expanded  focus  of  the  project  to  include  proteomics  research. 
This  aspect  of  the  expanded  project  is  designed  to  quantify  proteins  in  blood  that  can  be  linked  to  onset  of 
Type  1  Diabetes  (T1D).  The  goal  is  to  identify  diagnostic  biomarkers  that  can  be  used  alone  or  along  with 
DNA  genotyping  to  gauge  risk  for  developing  T 1 D  and  diabetes  complications. 

Rationale:  The  Diabetes  Prevention  Trial  of  Type  1  (DPT-1)  observed  that  individuals  with  2  or  more 
autoantibodies  exhibited  an  overall  68%  5  years  risk  while  those  with  3  autoantibodies  approached  a  near 
absolute  risk  for  developing  T1D  (Verge  et  al. ,  1996).  Results  from  subsequent  studies  have  shown  that 
multiple  autoantibodies  confer  cumulative  risk  ranging  from  75%  to  nearly  90%  during  5-10  year  prospective 
follow  up  of  first  degree  relatives  (Pietropaolo  and  Becker,  2001).  While  screening  for  the  presence  of  3 
autoantibodies  has  high  predictive  value  a  substantial  minority  (i.e. ,  roughly  20%  of  first  degree  relatives  who 
convert  to  T1 D)  exhibit  fewer  than  3  autoantibodies  (Pihoker  et  al.,  2005).  Moreover,  autoantibodies  may  also 
appear  successively  during  disease  progression  and  reflect  ongoing  autoimmune  destruction  of  insulin 
producing  pancreatic  beta  cells. 

Autoantibodies  provide  a  good  indication  of  risk  for  developing  T1 D.  However,  the  test  does  not  allow  staging 
of  subjects  for  when  in  the  next  decade  they  will  exhibit  diabetes,  it  does  not  correlate  with  severity  of  the  initial 
symptoms  of  diabetes,  nor  does  it  correlate  with  the  patient's  ability  to  control  their  diabetes.  The  presence  of 
multiple  autoantibodies  is  an  indication  that  autoimmune  destruction  of  insulin  producing  cells  is  already 
occurring  (i.e.,  autoimmune  disease  is  already  present).  Studies  of  subjects  who  are  positive  for  multiple 
autoantibodies  but  remain  diabetes  free  (i.e.,  false  positive)  indicate  that  for  2  autoantibodies  roughly  half  the 
subjects  will  remain  free  of  diabetes  and  for  3  autoantibodies  10%  to  25%  will  remain  diabetes  free.  Moreover, 
screening  for  autoantibodies  has  a  significant  false  negative  rate.  It  has  been  estimated  by  different  studies 
that  7%  to  19%  of  new  onset  T1D  patients  are  autoantibody  negative  (Wang  et  al.,  2007).  A  critical  goal  for 
prevention  is  to  suppress  destruction  of  beta  cells  as  early  as  possible.  For  these  reasons  discovery  of  new 
biomarkers  that  can  be  detected  prior  to  the  appearance  of  autoantibodies  (or  that  provide  a  chronologically 
accurate  prediction  of  when  T1 D  will  develop)  would  be  of  great  value. 

T1D  is  among  the  most  common  chronic  diseases  of  childhood  with  prevalence  and  incidence  estimated  in 
children  less  than  20  years  of  age  at  192/100,000  and  12.2/100,000,  respectively  (Jacobson  et  al.,  1997).  The 
overall  prevalence  of  the  disease  among  siblings  of  T1D  probands  is  increased  by  30-fold  while  among  HLA 
identical  siblings  prevalence  increases  by  more  than  80-fold.  Cohorts  used  in  the  T1D  trials  are  typically 
recruited  from  among  siblings  of  T1 D  probands  and  exclusion/inclusion  criteria  make  use  of  whether  2  or  more 
autoantibodies  are  present  along  with  the  subject's  HLA  genotype.  Because  the  presence  of  multiple 
autoantibodies  is  used  as  primary  criteria  for  inclusion  these  patients  are  at  a  stage  where  they  already  present 
autoimmune  disease.  This  imposes  increased  difficulty  on  development  of  successful  prevention  strategies. 
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Of  course,  only  15%  of  new  T1  D  cases  occur  among  first-degree  relatives.  In  other  words,  for  85%  of  those 
who  will  present  T1  D  there  is  currently  no  strategy  for  predicting  risk. 

A  test  that  would  enable  improved  risk  estimation  would  find  immediate  use  in  prediction  and  prevention  trials. 
An  issue  to  consider  is  that  peak  incidence  of  disease  occurs  at  10  to  14  years  of  age  and  therefore  the  most 
important  trials  will  involve  recruitment  of  children.  The  false  positive  rate  associated  with  the  autoantibody  test 
makes  it  an  unsatisfactory  tool  to  use  when  screening  young  subjects  for  inclusion.  This  would  be  especially 
true  in  the  event  of  a  trial  designed  to  test  a  preventative  treatment. 

What  population  would  ideally  be  used  in  a  large  clinical  trial  setting?  In  the  event  that  the  project  identifies  a 
set  of  strongly  predictive  biomarkers  an  appropriate  next  step  would  be  to  approach  TrialNet  (see 
http://www.diabetestrialnet.org).  The  TrialNet  organization  is  a  multi  center  study  with  the  goal  of  identifying 
subjects  for  T1  D  prevention  and  intervention  trials.  Children's  Hospital  of  Pittsburgh  (CHP)  is  already  acting  as 
a  clinical  center  for  the  TrialNet  natural  history  study  (Mahon  et  al.,  2009).  TrialNet  has  currently  assembled  a 
large  cohort  of  family  members  (N=1 2,636)  of  T1  D  index  cases  of  which  greater  than  300  are  being  monitored 
in  a  prospective  study  for  onset  of  T1D.  Dr.  Trucco  is  a  member  of  the  TrialNet  Steering  Committee  and  is 
also  the  Chair  of  the  TrialNet  Scientific  Review  Committee. 

Specific  Aims:  The  specific  aim  for  the  expanded  project  is  to  investigate  proteins  and  analytes  present  in 
blood  in  order  to  determine  if  changes  in  abundance  as  well  as  changes  in  post-translational  modifications  are 
associated  with  increased  risk  for  developing  T1D.  In  a  pilot  experiment  we  will  use  serum  collected  from  a 
cohort  of  T1  D  cases  and  non-TI  D  healthy  controls  to  validate  that  the  proteomics  approach  identifies  changes 
in  protein  or  analyte  abundance  associated  with  disease.  This  will  be  followed  by  an  analysis  of  serum 
collected  from  subjects  who  converted  to  T1D.  We  will  choose  a  set  of  serum  samples  collected  at  multiple 
time  points  from  T1  D  cases.  Blinded  samples  will  be  tested  on  aptamer  arrays  and  quality  control  analysis  of 
the  resulting  data.  Following  quality  control  analysis  the  samples  will  be  unblinded.  Data  will  be  analyzed  with 
the  goal  of  identifying  the  optimal  set  of  serum  biomarkers  exhibiting  sensitivity  to  T1  D  risk. 

In  latter  stages  of  the  project  we  will  select  a  second  set  of  serum  samples  from  T1 D  cases  that  can  be  used  to 
validate  markers  discovered  during  Aim  1  of  the  project.  Blinded  samples  will  be  tested  and  will  be  unblinded 
following  quality  control  analysis.  In  this  example  only  the  biomarkers  identified  initially  as  being  sensitive  for 
T1 D  risk  need  to  be  analyzed.  The  final  stage  will  be  to  analyze  samples  collected  from  low-risk  participants. 
We  will  select  a  set  of  serum  samples  from  low-risk  individuals  recruited  in  the  CHP  longitudinal  cohort.  These 
samples  have  been  collected  at  multiple  time  points  from  participants  who  have  remained  non-TI  D  at  the 
study's  endpoint.  Blinded  samples  will  be  used  and  following  quality  control  steps  will  be  unblinded. 

Table  1.  T1D  Case  and  non-TI  D  Healthy  Control  Progress  Report:  We  are  planning  an  experiment  in  which 
Samples.  we  will  use  serum  collected  from  thirty  subjects  with  Type  1 

N  Material  Source  Diabetes  (T 1 D)  and  from  thirty  healthy  (i.e. ,  non-diabetic) 

Case  30  Serum  AOB  Data/Serum  Bank  subjects  recruited  from  Italy  (Table  1).  The  serum  samples 

Control  30  Serum  AOB  Data/Serum  Bank  were  originally  collected  by  the  Non  Insulin  Requiring 

Autoimmune  Diabetes  (NIRAD)  Study  and  are  being  stored 
in  the  AOB  Data/Serum  Bank.  Our  experimental  design  is  to  work  with  50ul  of  serum  from  each  de-identified 
subject.  The  material  received  will  be  used  to  examine  the  abundance  of  circulating  proteins  and  analytes 
present  in  blood  with  the  goal  of  identifying  proteins  that  can  be  used  to  diagnosis  risk  of  developing  T1D. 
These  samples  are  already  in  existence  but  are  no  longer  needed  by  the  original  study. 

12.  Statement  of  Plans  for  the  Upcoming  Research  Period 

Goal  1.  Prepare  serum  samples  for  proteomic  analysis  on  aptamer  arrays. 

Goal  2.  Initiate  proteomic  analysis  of  serum  collected  from  N=30  T1D  cases  and  N=30  non-TI  D  controls. 

Literature  Cited: 
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In  our  second  quarterly  scientific  progress  report  (12/01/09  -  02/28/10),  we  presented  the  following 
data: 

During  the  previous  research  quarter  our  efforts  foe  used  upon  two  projects  not  mentioned  in  ourprior 
quarterly  reports.  They  were  to  (1)  finalize  a  manuscript  for  publication  and  (2)  follow  up  on  the 
results  obtained  during  the  current  DOD  funded  project  by  beginning  the  application  process  for 
extra  mural  funding  to  support  additional  studies  into  the  genetics  of  inherited  diabetes.  This  quarterly 
report  will  summarize  oureffortto  accomplish  these  two  goals. 

Completion  of  Goal  1.  Finalize  a  manuscript  describing  the  development  of  statistical  tools  for 
identifying  genomic  variants  (i.e.,  single  nucleotide  polymorphisms,  SNPs)  that  affect  an  individual's 
susceptibility  to  disease.  The  manuscript  has  been  submitted  to  the  peer-reviewed  journal  ANNALS 
OF  STATISTICS. 

The  manuscript  describes  our  research  into  development  of  methods  for  combining  genetic  data 
garnered  from  family-based  studies  with  those  collect  from  unrelated  subjects  during  case-control 
comparisons.  These  approaches  represent  the  primary  sampling  techniques  for  studies  of  gene  to 
phenotype  association.  Due  to  demographic,  biological,  and  random  forces,  genetic  variants  differ 
in  allele  frequency  in  populations  around  the  world,  creating  population  structure  or  stratification 
reflected  by  ancestry.  As  a  consequence,  case-control  studies  are  susceptible  to  spurious 
associations  between  genetic  variants  and  disease  status  (Lander  et  at,  1994).  As  more  data  are 
collected  the  challenge  of  spurious  associations  due  to  population  structure  increases  (Devlin  and 
Roeder,  1999;  Devlin  et  a I.,  2001;  Devlin  et  a I.,  2004).  The  research  problem  addressed  in  our  recently 
submitted  publication  is  how  to  use  both  case-control  and  family-based  data  in  a  single  test  for 
association.  We  seek  to  develop  an  approach  in  which  the  test  is  more  powerful  and  robust  to 
population  stratification  than  competing  approaches  (Nagelkerke  et  a I.,  2004;  Epstein  et  at,  2005). 
Our  approach,  which  is  robust  to  differences  in  sampling  distribution  across  studies,  control  Type  I 
errorwhile  attaining  good  power.  The  method  requires  that  sufficient  genotyping  isavailable  on  all 
samples  to  permit  matching  samples  based  on  genetic  ancestry. 

Asa  first  step  we  estimated  the  genetic  background  of  unrelated  individuals  (cases,  controls,  and  trio 
proba nds).  We  considered  genotypic  data  from  the  International  Ha p Map  Project  (30  C  EU  trios)  a nd 
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from  the  PO  PRES  data  base  (Nelson  et  al.,  2008).  Trio  probandsare  matched  to  one  or  more  controls 
that  are  genetically  similar  (Figure  1)  (Luca  et  al.,  2008).  The  distance  between  individuals  in  this 
eigenspace  is  representative  of  their  genetic  differences.  When  data  consist  of  family-based 
samples  as  trios  of  parents  and  their  affected  offspring,  as  well  as  additional  controls,  we  will  prefer 
matching  one  case  to  many  controls.  For  trios  pseudo-controls  are  automatically  matched  by 
ancestry  with  the  corresponding  proband,  and  will  be  contrasted  to  the  case  genotype.  Additional 
information  can  be  garnered  by  clustering  trio  probands  with  unrelated  controls.  In  this  way  we 
identify  additional  controls  matched  by  ancestry  to  the  probands  (Figure  1).  The  structure  of  the 
data  is  equivalent  to  a  matched  case-control  sample  and  hence  can  be  analyzed  via  conditional 
logistic  regression. 


Figure  1.  HapMap  trios  matched  by  ancesby  to  POPRES controls.  The  30  offspring  from  HapMap,  CEU  samples, 
trios  serve  as  cases  and  the  2,184  individuals  of  European  ancestry  from  the  POPRES  data  serve  as  controls.  The 
plot  displays  the  top  two  principal  components  of  ancestry  for  cases  (red)  and  controls  (black).  Based  on  the 
distribution  of  points  in  the  eigenmap,  many  available  controls  would  not  be  good  matches  to  the  HapMap 
trios.  Only  those  delineated  in  blue  are  considered  further.  Each  case  is  matched  to  one  or  more  controls  that 
are  genetically  similar  based  on  the  eigenvectors. 

Some  unrelated  controlswill  not  be  similarenough  to  any  probandsto  merit  inclusion  in  the  study.  For 
example,  the  HapMap  trios  can  only  be  successfully  matched  to  a  subset  of  the  full  European 
samples  in  POPRES  (Figure  1).  Likewise  some  unrelated  cases  might  not  be  well  matched  by  any 
unrelated  controls  in  the  study.  Our  approach  provides  features  that  facilitate  the  removal  of 
individuals  who  cannot  be  successfully  matched  because  their  genetic  ancestry  is  too  remote, 
relative  to  the  others  in  the  sa mples  (Luca  et  al.,  2008;  Lee  et  al.,  2010).  These  individuals  should  be 
removed  from  further  consideration  in  the  association  study.  Once  the  strata  are  established,  a 
natural  next  step  is  to  compare  the  differences  in  genotypes  between  the  cases  and  controls  by 
using  conditiona  I  logistic  regression  (data  not  shown). 


Application  to  Type  1  Diabetes.  In  previous  studies  Type  1  Diabetes  (TLD)  has  demonstrated  a  strong 
association  with  the  HLA  region  of  chromosome  6  (Davies  et  al.,  1994).  To  illustrate  our  method  we 
consider  joint  a  na  lysis  of  19  TLD  trios  with  just  over  2,000  independent  controls.  All  fa  mily  a  nd  control 
samplesare  of  European  ancestry;  for  details  a  bout  the  data  see  Luca  et  al.  (Luca  et  al.,  2008).  First, 
we  estimated  the  ancestry  of  the  controlsand  plotted  them  against  their  two  most  significant  axesof 
genetic  variation  (Figure  2).  We  then  projected  the  19  trio  probands  onto  the  control's  eigenmap. 
The  full  match  algorithm  identified  19  distinct  strata,  each  including  exactly  one  trio  proband,  and 
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between  19  and  359  controls.  We  call  these  unbalanced  strata  "all  controls",  to  indicate  that  we 
matched  the  full  sample  controls. 


Eigenvector  1 

Figure  2.  Egenmap  for  Type  1  diabetes  data,  ftobands  (red)  are  plotted  on  the  eigenmap  determined  by  the 
controls  (black). 


For  our  analysis  we  also  chose  the  closest  controls  to  each  case.  For  SNPs  in  the  HLA  region,  we 
evaluated  the  success  at  detecting  association  with  TLD.  From  our  results  it  is  apparent  that  as  the 
numberof  matches  increases  the  powerto  detect  certain  SNPsalso  increases  (Figure  3).  The  best  p- 
value  iswell  overtwo  ordersof  magnitude  betterwhen  using  all  of  the  controls.  The  strongest  signals 
occur  at  SNPs  rs241427  and  rs9273363  located  nearthe  confirmed  TLD  susceptibility  locus  HLA-DQB1 
within  the  HL4  class II  region  (Davieset  al.,  1994). 
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figure  3.  Association  between  HLA  markers  and  type  1  Diabetes.  The  -  log  10(p- values)  are  plotted  versus 
individual  SNRs  in  the  HLA  region  of  chromosome  6.  (A)  All  controls  matched;  (B)  l:10matching:  (C)  1:5 
matching;  (D)  Trios  only.  The  strongest  association  occurs  for  rs241427  (diamond)  and  next  strongest  for 
rs9273363  (triangle). 


Completion  of  Goal  2.  Initiate  the  application  process  forextramural  funding. 

We  have  identified  the  Human  Frontiers  Science  Program  (HFSP)  along  with  the  National  Institutes  of 
Health  (NIH)  as  potential  funding  sources  to  support  our  future  research  into  genetic  influences 
affecting  risk  of  developing  diabetes.  Our  hypothesis  isthat  highly  penetrant  gene  variants  can  be 
identified  through  their  effect  on  molecular  networks.  The  goal  of  the  new  project  will  be  to  develop 
molecular  and  statistical  tools  to  identify  perturbations  of  expression  in  gene  networks.  Ultimately 
these  tools  will  be  useful  for  biologists  searching  genomesforrare,  highly  penetrant  variants. 


9 


7 Tie  project  aims  a  re  as  follows: 

Project  Aim  1:  a)  use  gene  and  pathway  ontology,  and  our  experiments,  to  identify  masterand  minor 
gene-expression  regulators  (leading  indicator  genes)  in  a  model  system  (B-lymphocyte  cell-lines) 
after  stimulation  a nd  during  dynamic  growth;  b)  perturb  select  genes  in  co-regulated  pathways  by 
siRNA  knockdown;  c)  measure  RNA  expression  at  multiple  time  points  by  massively  parallel 
sequencing;  d)  build  statistical  models  from  resulting  data . 

Project  Aim  2:  test  statistical  models  by  additional  experiments;  refine  and  "robustify"  models 
considering  these  results. 

Project  Aim  3:  experiments  paralleled  by  statistical  models  to  determine  if  perturbation  of  expression 
can  be  detected  from  pooled  samples,  in  which  only  a  fraction  of  samples  have  perturbed  gene 
expression.  If  such  deconvolution  is  possible,  it  will  reduce  costs  for  experiments  aimed  at  identifying 
rare  variants  affecting  phenotype. 


Relationships  between  supracellular  components  (biological  systems),  intracellular 
components,  and  the  function  and  behavior  of  these  components  are  revealed  by  the 
interaction  of  individual  components 


Biological  systems 


pathways 
cells  &  tissues 
organs 
organism 


Environmental  conditions 
A,  b,  c  .... 

SYSTEMS  BIOLOGY 


Components 


transcripts 
proteins 
metabolites 
modifications 
etc 

Modified  from  Baginsky  S.  et.al.  Plant  Physiol.  2010:152:402-410 


Function  &  Behavior 


phenotype 
behavior 
pathway  function 
cell  &  tissue  function 


Reject  Summary.  The  project  will  analyze  the 
biology  of  model  cell  lines  (i.e .,  B-lymphocyte 
cell  lines [BLC Lj)  strains.  Data  will  be  collected 
on  siRNA  knockdown  of  select  genes  and 
molecular  networks.  The  BLCLcan  be  used  as 
surrogate  cells  for  the  study  of  antigen 
presenting  cell  function.  Molecular  analyses 
(i.e.,  immunogenetic  and  tra nscriptome  data) 
will  be  conducted  and  used  to  synthesize  a 
mathematical  model  of  molecular  and  cellular 
data  for  correlations  in  molecular  responses  of 
BLCL  to  environmental  conditions  (i.e., 
activating  stimuli).  During  the  project  model 
cell  lines  will  be  created  to  identify  genes 
regulating  molecular  networks.  Knockdown 
strains  will  be  generated  to  study  the  response 
cytokines,  phorbol  myristate  acetate,  ionomycine 
characterization  of  BLCL  will  be  used  to  screen  the 


of  BLCL  to  environmental  conditions  (e.g., 
treatment).  Our  expertise  in  the  creation  and 
already  existing  collection  of  cell  lines(n=300)  immortalized  from  a  healthy  ancestry  matched  cohort. 
Conditions  evaluated  will  be  EBV  infection,  growth  rate,  secretion  of  cytokines,  expression  of  cell 
surface  markers,  and  post-translational  modifications  of  MAP  kinase  proteins. 


The  project  will  use  next  generation  sequencing  to  evaluate  changes  in  RNA  abundance,  splice 
variants,  and  allele  specific  expression.  The  resulting  data  will  be  integrated  with  time  seriesdata  on 
cell  phenotypic  variation  to  support  development  of  robust,  predictive  data  models  of  the  network 
phenotype  associated  with  BLCL  response  under  defined  environmental  conditions  (e.g.,  various 
mitogen  stimulation  of  BLCL,  surrogate  antigen  presenting  cells).  The  data  will  be  combined  with 
molecular  aspects  of  the  project  and  mathematical  modeling  to  recognize  molecular  networks  and 
network  components  (e.g.,  gene  expression  correlations)  causal  for  network  and  cell  response  to 
environmental  signals.  Integration  of  biological  data  with  bio  informatics  and  gene/pathway 
ontology  and  mathematical  models  will  be  used  to  develop  comprehensive  description  of  surrogate 
antigen  presenting  cell  response  to  environmental  stimuli.  During  the  project  we  will  compare  data 
garnered  from  mitogen  stimulated  BLC  Lin  the  presence  of  specific  MAP  kinase  inhibitors  as  well  as 
siRNA  knockdowns  of  network  regulatory  genes  that  will  allow  development  of  a  data  model  in 
which  changesto  network  phenotypes ca n  be  identified. 
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Interacting  Networks  and  Common  Variants 


Mathematical  work  will  bean  integral  part  of 
the  project.  We  will  build  on  the  literature  of 
Graphical  Gaussian  models  and  dynamic 
Bayesian  networks  to  develop  and  use 
statistical  methods  to  estimate  co-expression 
networks  for  longitudinal  data.  The  models 
aim  to  detect  conditionally  dependent 
genes  from  the  experimental  data  and  thus 
determine  the  causal  relationships  within 
networks.  From  these  results,  we  will  develop 
methods  for  modeling  the  effects  of 
perturbations  such  as  knockdowns  on  the 
networks.  Based  on  the  partial  correlation 
structure  before  and  after  perturbation,  we 
can  construct  hypothesis  tests  for  which 
source: TiDBase (http://tidbase.org)  genes  are  reg ula ting  the  network.  These 

results  will  determine  ourpowerto  detect  perturbations  when  the  signal  is  less  refined.  Pooled  data, 
for  example,  will  yield  a  muted  signal  due  to  the  convex  combination  of  network  signals.  Analysis  of 
data,  both  simulated  and  actual,  will  determine  the  feasibility  of  pooling  asa  strategy. 


YESl 

jSo.55  0.52s*. 

CD69  x — 152 — ►  TNFRSF10B 

033^*  *'.0.48  0.50^ 

RGS1  ^ — *  LY75 

Red  Arrows  indicate  positive  Pearson  correlation  (average  r=0.52) 

Source:  Nayak  et  al.  (2009)  Genome  Research  19:1953-1962 

Loci  in  LD  with  Common  Variants  Associated  with  Autoimmune  Disease  Phenotypes 


Locus 

Chromosome 

Autoimmune 

Phenotype 

Common 

Variant 

Odds  Ratio  (95%CI) 

RGS1 

lq31.2 

Celiac,  T1D 

rs2816316 

0.89  (0.84-0.95) 

CD69 

12pl3.31 

T1D 

rs4763879 

1.09(1.02-1.16) 

IFIH1 

2q24.2 

SLE,  T1D 

rs  1990760 

0.86  (0.82-0.90) 

FYoject  Outcomes.  Identify  rare  variants  affecting  phenotypic  variability  -  identify  environmental 
sensitive  variants.  Integrate  levels  of  analysis  ranging  from  mathematics,  cellular,  molecular,  in  silico 
and  in  vitro  molecular  interactions,  and  pathway  modeling  to  understand  regulation  of  molecular 
networks.  Approach  facilitates  bidirectional  flow  of  knowledge:  in  vitro  to  in  silico  (Aim  1),  in  silico  to  in 
vitro  (Aim  2),  and  cell  models  enabling  simulation  of  complex  network  dynamics  (Aim  3).  The 
approach  builds  on  the  collective  expertise  in  immunology,  genetics,  and  applied  and  theoretical 
statistics. 


Slats  merit  of  Plans  for  the  Upcoming  Research  Period 

Goal  1.  Obtain  data  from  public  sources  for  use  in  simulations  a nd  data  model  building.  Milestone 
1A.  Ac  cess  data  sets. 

Goal  2.  Initiate  mathematical  model  building  using  publicly  available  datasets.  Milestone  2A.  Build 
modelsto  describe  the  data .  Milestone  2B.  Test  a  nd  refine  data  models 
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In  our  third  quarterly  scientific  progress  report  (03/01/10  -  05/31/10)  we  then  reported  the  following 
findings: 

Our  efforts  during  the  recently  completed  research  quarter  were  focused  upon  two  goals:  1)  To  obtain  data 
from  public  sources  for  use  in  simulation  and  data  model  building,  and  2)  To  initiate  mathematical  model 
building  using  publicly  available  datasets.  These  goals  have  been  completed  and  are  described  below. 
Moreover,  our  manuscript  submitted  to  ANNALS  OF  STATISTICS  has  been  accepted  for  publication  and  is  in 
press.  The  publication  describes  our  ongoing  effort  to  exploit  advanced  statistical  methods  for  combining  data 
garnered  from  different  study  designs,  such  as,  family-based  and  case-control  studies.  The  goal  of  this  work 
has  been  to  develop  methodology  for  combining  the  results  of  different  study  designs  into  a  single  test  statistic. 
Application  of  our  method  to  Type  1  Diabetes  identified  increased  association  between  gene  and  disease 
phenotype  by  combining  inheritance  of  HLA-class  II  alleles  within  families  with  that  collected  from  unrelated 
case  and  control  subjects.  The  citation  for  this  publication  is  Crossett  A,  Kent  BP,  Klei  L,  Ringquist  S,  Trucco 
M,  Roeder  K,  Devlin  B.  Using  Ancestry  Matching  to  Combine  Family-Based  and  Unrelated  Samples  for 
Genome-Wide  Association  Studies.  ANNALS  OF  STATISTICS  (in  press). 

Previous  Quarter  Research  Goals 

Goal  1 .  Obtain  data  from  public  sources  for  use  in  simulations  and  data  model  building. 

Goal  2.  Initiate  mathematical  model  building  using  publicly  available  datasets. 
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The  two  goals  have  been  completed.  Briefly,  we  have  identified  appropriate  gene  expression  data  accessible 
through  the  National  Institutes  of  Health  (NIH)  sponsored  Gene  Expression  Omnibus  (GEO)  database  (Edgar 
et  al.,  2002).  This  resource  consists  of  primary  data  on  the  abundance  of  mRNA  obtained  during  research 
projects  that  have  been  awarded  funding  through  the  NIH.  A  dataset  created  using  human  liver  cells 
(deposited  in  part  by  our  University  of  Pittsburgh  colleague  Stephen  Strom)  has  been  identified  for  our 
analyses.  These  data  have  been  downloaded  (completing  Goal  1)  and  along  with  phenotype  information 
available  on  the  human  subjects  recruited  for  the  study  are  being  used  to  initiate  mathematical  modeling  of 
gene-gene  interactions  described  in  Goal  2  of  the  recently  completed  research  quarter. 


Detailed  Description  of  Goals  1  and  2 


Completion  of  Goal  1.  Obtain  data  from  public  sources  for  use  in  simulations  and  data  model  building. 
Table  1 .  Cohort  Summary. 


Number  of  Samples  427 

Male/Female  Ratio  234/193 

Mean  Age  (years)  50 

Age  Range  (years)  0-94 

Number  with  Steatosis  105 


Ref:  Schadt  (2008)  PLOS  Biology  6:1020-1032. 


Milestone  1A.  Access  datasets.  We  have  chosen  to  begin  our  analyses  of  gene  co-expression  networks  by 
evaluating  various  statistical  approaches.  To  accomplish  this  we  have  identified  a  data  set  available  at  the  NIH 
sponsored  GEO  database.  The  collection  of  data  consists  of  microarray  generated  gene  expression  data 
collected  from  liver  samples  from  n=427  human  subjects  (Table  1).  These  data  are  appropriate  for  our 
research  into  Diabetes  and  Diabetes  Complications  in  that  liver  represents  an  important  metabolic  organ 
responding  to  insulin.  Under  physiologic  conditions  in  which  insulin  secretion  and/or  insulin  signaling  are 
dysregulated  the  liver's  contribution  to  the  increase  in  blood  glucose  as  well  as  cholesterol  and  triglycerides  is 
increased.  Directly  implicating  insulin  dependent  regulation  of  liver  metabolic  function  with  cardiovascular 
complications  associated  with  Diabetes.  The  cohort  available  for  this  study  consisted  of  n=427  subjects  of 
which  approximately  55%  where  male  and  45%  female.  Notably  25%  of  the  liver  samples  showed  evidence  of 
mild  to  severe  steatosis,  that  is,  fatty  liver  disease.  The  steatosis  phenotype  will  be  used  in  subsequent 
analyses  to  determine  whether  gene  co-expression  correlations  are  sensitive  indicators  for  the  presence  of  the 
fatty  liver  disease  phenotype. 


Table  2.  Example  of  Genes  Tested. 


Array  ID 

Gene  ID 

Symbol 

10033668539 

341 

APOC1 

10023813203 

348 

APOE 

10023833467 

1583 

CYP11A1 

10033668886 

1584 

CYP11B1 

10023818702 

4547 

MTTP 

10025910281 

5105 

PCK1 

10033668775 

10891 

PPARGC1A 

As  an  example  of  the  data  collected  we  obtained  data  for  the  complete  cohort  of  n=427  subjects  measuring 
mRNA  abundance  on  roughly  40,000  transcripts  from  each  sample.  This  corresponds  to  about  17  million 
(=427x40,000)  individual  data.  Using  customized  computer  scripts  we  have  organized  the  data  to  evaluate  the 
interaction  between  liver  specific  target  genes  and  their  corresponding  transcription  factors.  As  shown  in  Table 
2  select  genes  were  examined  based  upon  their  known  importance  to  liver  metabolic  function.  For  example, 
PCK1  encodes  the  enzyme  Phosphoenolpyruvate  Carbolykinase  1  (soluble).  PCK1  gene  expression  is 
regulated  by  insulin.  It  is  produced  when  insulin  levels  are  low  (occurring  during  fasting)  and  transcription  of 
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this  gene  is  rapidly  turned  off  when  insulin  levels  are  high  (during  feeding).  The  gene  product  is  a  main  control 
point  for  regulation  of  gluconeogenesis.  The  gene  product  functions  by  catalyzing  the  conversion  of 
oxaloacetate  to  phosphoenolpyruvate.  Other  genes  examined  by  our  studies  were  chosen  based  upon  similar 
criteria  of  their  importance  to  insulin  regulated  liver  metabolism  and  metabolic  disease. 

Table  3.  Example  of  Relative  mRNA  Abundance. 


Array  ID 

GSM24221 3 

GSM242214 

GSM24221 5 

10033668539 

-0.3934 

-0.3791 

-0.1857 

10023813203 

0.2432 

0.1975 

0.5117 

10023833467 

-0.1083 

-0.143 

-0.0699 

10033668886 

-0.0322 

0.0246 

-0.0189 

10023818702 

-0.192 

-0.1284 

0.0556 

10025910281 

0.3219 

-0.1297 

-0.3554 

10033668775 

0.154 

-0.2382 

0.1307 

In  Table  3  we  show  an  example  of  the  mRNA  expression  data  that  has  been  collected.  The  table  summarizes 
data  for  the  genes  listed  in  preceding  Table  2  (see  the  column  labeled  Array  ID)  but  also  lists  the  log(2) 
normalized  expression  signal  from  the  mRNA  microarray.  For  the  data  in  Table  3,  values  of  zero  indicate  no 
change  relative  to  the  mean  mRNA  abundance  while  positive  values  indicate  increased  mRNA  expression  and 
negative  values  represent  decreased  gene  expression.  The  column  headers  beginning  with  GSM  refer  to  each 
sample  that  was  used.  There  is  a  unique  header  for  each  of  the  n=427  liver  samples,  providing  a  data  matrix 
of  40,000-by-427  for  co-expression  analysis.  Collection  of  this  data  along  with  the  phenotype  information 
summarized  in  Table  1  represent  completion  of  Milestone  1 A  of  the  previous  research  quarter. 

Completion  of  Goal  2.  Initiate  mathematical  model  building  using  publicly  available  datasets. 

Milestone  2A.  Build  models  to  describe  the  data.  In  order  to  build  data  models  to  account  for  the  co¬ 
expression  correlations  between  genes  we  chose  to  focus  on  liver  specific  Transcription  Factors  that  have 
been  shown  to  regulate  genes  critical  for  liver  metabolic  function,  such  as  those  listed  above  in  Table  2.  Listed 
in  Table  4  are  the  Transcription  Factors  that  were  evaluated.  For  each  gene,  the  table  includes  the  official 
gene  symbol,  gene  identifying  number,  and  full  name.  For  example,  PPARGC1A  encodes  the  co-transcription 
factor  Peroxisome  Proliferator-Activated  Receptor  Gamma,  Coactivator  1  Alpha.  The  gene  product  has  been 
previously  identified  as  a  critical  transcriptional  coactivator  regulating  genes  involved  in  energy  metabolism.  It 
is  known  to  interact  with  other  Transcription  Factors,  such  as,  cAMP  Response  Element  Binding  Protein, 
Forkhead  Box  01,  and  Peroxisome  Proliferator-Activated  Receptor  Gamma.  It  provides  a  direct  link  between 
external  physiological  stimuli  and  the  regulation  of  mitochondrial  biogenesis,  and  is  a  major  factor  regulating 
cellular  cholesterol  homeostasis  and  has  been  implicated  in  the  development  of  obesity. 

Table  4.  Transcription  Factors  that  Regulate  Liver  Metabolic  Functions. 


Symbol 

Gene  ID 

Official  Full  Name 

CEBPA 

1050 

CCAAT/Enhancer  Binding  Protein  (C/EBP),  Alpha 

CEBPB 

1051 

CCAAT/Enhancer  Binding  Protein  (C/EBP),  Beta 

CEBPD 

1052 

CCAAT/Enhancer  Binding  Protein  (C/EBP),  Delta 

CREB1 

1385 

cAMP  Responsive  Element  Binding  Protein  1 

CRTC2 

200186 

CREB  regulated  transcription  coactivator  2 

FOXA1 

3169 

Forkhead  Box  A1 

FOXA2 

3170 

Forkhead  Box  A2 

FOXA3 

3171 

Forkhead  Box  A3 

FOXOIA 

2308 

Forkhead  Box  01 

HNF1A 

6927 

HNF1  Homeobox  A 

HNF4A 

3172 

Plepatocyte  Nuclear  Factor  4,  Alpha 

MLXIPL 

51085 

MLX  Interacting  Protein-Like 

NR1H2 

7376 

Nuclear  Receptor  Subfamily  1 ,  Group  H,  Member  2 

NR1H3 

10062 

Nuclear  Receptor  Subfamily  1 ,  Group  FI,  Member  3 

NR1H4 

9971 

Nuclear  Receptor  Subfamily  1 ,  Group  H,  Member  4 
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PPARA  5465  Peroxisome  Proliferator-Activated  Receptor  Alpha 

PPARD  5467  Peroxisome  Proliferator-Activated  Receptor  Delta 

PPARG  5468  Peroxisome  Proliferator-Activated  Receptor  Gamma 

PPARGC1A  10891  Peroxisome  Proliferator-Activated  Receptor  Gamma,  Coactivator  1  Alpha 

RXRA  6256  Retinoid  X  Receptor,  Alpha 

RXRG  6258  Retinoid  X  Receptor,  Gamma 

SREBF1  6720  Sterol  Regulatory  Element  Binding  Transcription  Factor  1 

SREBF2  6721  Sterol  Regulatory  Element  Binding  Transcription  Factor  2 

Table  5.  Selected  Transcription  Factor  and  Target  Gene  Pairs. 


Transcription  Factor 

Gene  ID 

Taraet  Gene 

Gene  ID 

PPARGC1A 

10891 

CPT1A 

1374 

PPARGC1A 

10891 

CYP7A1 

1581 

PPARGC1A 

10891 

ESRRA 

2101 

PPARGC1A 

10891 

G6PC 

2538 

PPARGC1A 

10891 

LDLR 

3949 

PPARGC1A 

10891 

PCK1 

5105 

PPARGC1A 

10891 

MEN2 

9927 

The  Target  Genes  and  Transcription  Factors  summarized  in  Table  2  and  4  can  also  be  described  in  terms  of 
previously  identified  causal  networks.  These  are  summarized  in  Table  5.  For  example,  the  Transcription 
Factor  PPARGC1A  has  been  implicated  in  regulating  the  expression  of  Target  Genes  such  as  PCK1 
(Phosphoenolpyruvate  Carbolykinase  1,  soluble)  and  G6PC  (Glucose-6-Phosphatase,  Catalytic  Subunit).  The 
gene  product  of  G6PC  is  an  integral  membrane  protein  of  the  endoplasmic  reticulum  that  catalyzes  hydrolysis 
of  D-glucose  6-phospate  to  D-glucose  and  orthophosphate.  It  is  a  key  enzyme  in  glucose  homeostasis, 
functioning  in  gluconeogenesis  and  glycogenolysis,  processes  tightly  controlled  by  insulin  secretion  and 
signalling. 

Milestone  2B.  Test  and  refine  data  models.  The  gene  expression  data  collected  from  n=427  liver  samples 
has  been  queried  for  co-expression  correlations  based  upon  previously  known  Transcription  Factor  and  Target 
Gene  interactions  summarized  in  Table  5.  As  illustrated  in  Figure  1  we  have  observed  strong  correlation 
between  the  transcription  co-activator  PPARGC1A  and  a  main  enzymatic  control  point  for  the  regulation  of 
gluconeogenesis,  the  gene  PCK1.  The  Pearson's  correlation  between  these  two  genes  is  0.75  and  accounts 
for  roughly  56%  of  the  variance  observed  in  the  data  (Figure  1).  Other  gene-gene  transcription  correlations 
have  been  examined.  For  example,  PPARGC1A  and  G6PC  exhibit  a  positive  correlation  (Pearson's 
correlation  measurement  equals  0.68)  and  the  insulin  responsive  transcription  factor  FOXOIA  and 
PPARGC1A  exhibit  a  Pearson's  correlation  of  0.72  (data  not  shown).  The  high  level  of  correlation  between 
these  gene  pairs  (i.e.,  FOXOI -PPARGC1  A,  PPARGC1 A-PCK1 ,  and  PPARGC1 A-G6PC)  is  consistent  with 
known  insulin  regulated  transcriptional  control  of  hepatic  gluconeogenesis. 
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Gene  Co-Expression  (Pearson's  Correlation=0.75,  n=427) 


Statement  of  Plans  for  the  Upcoming  Research  Period 

Goal  1.  Continue  to  develop  mathematical  models  to  measure  gene  expression  and  the  correlation  between 
expressions  of  paired  genes. 

Milestone  1A.  Complete  analysis  of  transcription  co-expression  observed  between  transcription  factors  and 
their  target  genes. 

Milestone  IB.  Discover  the  presence  of  new  gene-gene  co-expression  pairs  using  transcription  factors  and 
the  complete  set  of  gene  expression  data. 

Goal  2.  Initiate  the  writing  of  a  scientific  article  for  publication  in  a  peer  reviewed  scientific  journal. 

Milestone  2A.  Outline  a  manuscript  describing  the  results  of  our  research  into  recognizing  co-expression 
correlations  between  genes. 
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In  the  fourth  and  final  quarterly  scientific  progress  report  (06/01/10  -  08/26/10)  of  year  03,  we  now  report  on  our 
cumulative  results. 

Our  work  performed  during  the  recently  completed  research  quarter  focused  on  the  goals  of  developing 
mathematical  models  to  measure  gene  expression  and  co-expression  of  correlated  genes,  and  to  begin  the 
process  of  writing  the  results  for  publication  in  a  peer  reviewed  scientific  journal.  These  goals  were  designed 
to  allow  our  research  group  to  establish  sufficient  expertise  in  the  disciplines  required  to  enable  the 
investigation  of  gene  co-expression  networks.  We  have  made  substantial  progress  toward  achieving  both 
goals  as  detailed  below. 

Previous  Quarter  Research  Goals 

Goal  1.  Continue  to  develop  mathematical  models  to  measure  gene  expression  and  the  correlation  between 
the  expressions  of  paired  genes. 

Goal  2.  Initiate  the  writing  of  a  scientific  article  for  publication  in  a  peer  reviewed  scientific  journal. 

Detailed  Description  of  Goals  1  and  2 

Completion  of  Goal  1.  Continue  to  develop  mathematical  models  to  measure  gene  expression  and  the 
correlation  between  the  expressions  of  paired  genes. 

Milestone  1A.  Complete  analysis  of  transcription  co-expression  observed  between  transcription  factors  and 
their  target  genes.  We  have  identified  3  datasets  containing  gene  expression  data  collected  from  biological 
materials  from  human  subjects.  Two  of  the  datasets  use  B-lymphoblastoid  cell  lines  (BLCL)  and  the  third 
measured  mRNA  expression  in  human  liver  samples  (summarized  in  Table  1).  The  combined  samples 
measured  n=791  samples  collected  from  human  subjects.  These  data  are  available  to  our  research  effort, 
have  been  downloaded,  and  are  currently  being  evaluated  for  gene  co-expression  events.  One  goal  of  the 
project  is  to  use  real  biological  and  experimental  data  to  develop  advanced  mathematical  models  for 
recognizing  gene-to-gene  correlations. 

Table  1 .  Available  Genome  Wide  Expression  Data  Collected  from  Human  Subjects 


Number  of 

Genome-Wide 

Cell  Type 

Samples 

Expression 

Phenotype 

Source 

Reference 

BLCL 

294 

Yes 

None 

HapMap 

Nayak  et  al.  (2009) 

BLCL 

70 

Yes 

Type  1  Diabetes 

B.O. Boehm 

Personal  Communication 

Liver 

427 

Yes 

Fatty  Liver  Disease 
Type  2  Diabetes 

Merck 

Schadt  et  al.  (2008); 

Yang  et  al.  (2010) 

The  dataset  available  from  human  liver  samples  collected  from  n=427  subjects  is  being  studied.  These  data 
are  accompanied  by  demographic  data  including  subject  age,  gender,  race,  and  body  mass  index  (BMI)  (Table 
2).  We  also  have  data  collected  on  liver  disease,  including  fatty  liver  (i.e.,  steatosis)  as  well  as  exposure  to 
liver  toxins  (data  not  shown).  An  early  goal  that  will  be  pursued  in  the  upcoming  research  period  will  be  to 
increase  the  dimensionality  of  our  analyses  by  incorporation  of  subject  health  and  demographic  data  into  our 
modeling  of  co-expression  networks.  The  goal  of  the  project  is  to  identify  gene  neighborhoods  that  correspond 
with  human  health  and  disease,  e.g.,  steatosis. 

Table  2.  Demographics  of  Selected  Human  Liver  Donors 


Donor  ID  Aae 

Steatosis:  Mild 

Sex 

Race 

BMI  (ka/mA2) 

2220015  35 

F 

W 

26.2 

2220017  53 

M 

W 

16.3 

2220018  22 

M 

H 

22.0 

2220021  20 

Steatosis:  Moderate 

M 

W 

23.5 

2220109  37 

M 

W 

22.4 

2220137  61 

F 

W 

33.1 

2220138  49 

F 

W 

39.0 
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2220151  57 

Steatosis:  Severe 

M 

W 

27.7 

2220074  13 

M 

W 

32.2 

2220187  18 

M 

W 

24.9 

2220025  28 

M 

W 

26.6 

2220191  30 

F 

W 

40.2 

Milestone  IB.  Discover  the  presence  of  new  gene-gene  co-expression  pairs  using  transcription  factors  and 
the  complete  set  of  gene  expression  data.  In  addition  to  the  samples  collected  from  human  liver  donors  we 
have  comparable  datasets  on  gene  expression  in  BLCL.  By  combining  these  data  we  intend  to  examine  gene 
networks  from  n=364  BLCL  samples.  The  samples  available  from  our  collaboration  with  Bernhard  Boehm 
(University  of  Ulm,  Germany)  include  gene  expression  data  collected  from  n=3  subjects  with  Type  1  Diabetes 
(T1D)  and  n=4  control  subjects.  The  BLCL  were  treated  with  5  growth  conditions  (including  PMA,  LPS-low, 
LPS-high,  and  IL-lb)  in  order  to  examine  stimulation  of  NFkB  transcription  factor  dependent  gene  expression 
during  gene  network  analysis.  Table  3  summarizes  the  experimental  design  used  to  stimulate  the  NFkB 
dependent  gene  expression  pathway  of  T1D  and  control  samples.  The  subject  identifying  number  and 
phenotype  are  listed  in  columns  1  and  2.  Columns  3  through  7  list  the  dataset  identifying  number  for  the 
control  growth  condition  (column  3)  as  well  as  stimulating  conditions  (columns  4  through  7).  For  each  of  the  70 
conditions  that  were  tested  there  is  a  corresponding  dataset  of  roughly  40,000  data  points  from  which  the  level 
of  a  corresponding  mRNA  transcript  can  be  determined.  It  is  the  correlation  network  existing  between  these 
data  points  that  will  be  the  subject  of  the  upcoming  work  period. 


Table  3.  Genome  Wide  Expression  Data  Collected  on  T1D  and  Control  BLCLs 


Subiect  ID  Diaqnosis 

Incubation  Time  8  Hours 

Media 

PMA  (30na/ml) 

LPS  (20na/ml) 

LPS  (lOOnq/ml) 

IL-lb  (Inq/ml) 

1 69B 

Control 

52991 87028H 

52991 87028J 

52991 87026B 

52991 8701 2B 

52991870121 

BOB-5013 

Control 

52991870281 

52991 87028E 

52991 87030G 

52991 87030D 

52991 87029D 

ET-2036O 

Control 

52991 87028L 

52991 87028K 

52991 87028G 

52991 87028C 

52991 87028F 

ET-2036w 

Control 

52991 87026F 

52991 8701 2C 

52991 87030E 

52991 87029H 

52991 8701 2D 

BOB-5014 

T1D 

52991 87028B 

52991 87028A 

52991 87029L 

52991 87029A 

52991 8701 2H 

ET-2037O 

T1D 

52991 87029C 

52991 87026A 

52991 87026E 

52991 87030L 

5299187012E 

ET-2037w  T1D  52991 87030A 

Incubation  Time  24  Hours 

52991 87030B 

52991 8701 2A 

52991870291 

52991 8701 2G 

1 69B 

Control 

52991 87030C 

52991 87030F 

52991 87026H 

52991 87030J 

52991 87029K 

BOB-5013 

Control 

52991 87029J 

52991 87029E 

52991 8701 2L 

52991 8701 2F 

52991 87026J 

ET-2036O 

Control 

52991 87027J 

52991 87027H 

52991 87027G 

52991870271 

52991 87029B 

ET-2036w 

Control 

52991870301 

52991 8701 2K 

52991 87026L 

52991 87030H 

52991 87030K 

BOB-5014 

T1D 

52991 87026C 

52991 87026K 

52991 8701 2J 

52991870261 

52991 87026D 

ET-2037O 

T1D 

52991 87027C 

52991 87027E 

52991 87027L 

52991 87027F 

52991 87027A 

ET-2037w 

T1D 

52991 87027K 

52991 87028D 

52991 87027D 

52991 87027B 

52991 87029G 

We  have  been  able  to  identify  evidence  for  as  many  as  n=419  known  and  predicted  NFkB  target  genes.  A 
subset  of  known  NFkB  dependent  genes  are  listed  in  Table  4.  The  signals  obtained  from  probes  designed  to 
measure  the  expression  of  these  genes  will  be  used  during  our  initial  analyses  of  the  BLCL  derived  data  that 
were  collected  and  are  summarized  in  the  preceding  Table  3.  The  hypothesis  being  tested  is  that  the  stimuli 
used  in  the  experiment  (i.e.,  the  NFkB  stimulating  compounds  PMA,  LPS,  and  ILIb)  will  show  evidence  of 
differentially  affecting  gene  expression  of  a  subset  of  mRNA  transcripts  listed  in  the  Table  4  and  that  the 
pattern  of  co-expression  will  differ  between  T1 D  and  control  subject. 
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Table  4.  Selected  NFkB  Target  Genes 


Official  Symbol 

Gene  ID 

Official  Full  Name 

Transcription  Control 

AHCTF1 

25909 

AT  hook  containing  transcription  factor  1 

CEBPD 

1052 

CCAAT/enhancer  binding  protein  (C/EBP),  delta 

CREB3 

10488 

cAMP  responsive  element  binding  protein  3 

E2F3 

1871 

E2F  transcription  factor  3 

LEF1 

51176 

lymphoid  enhancer-binding  factor  1 

SP7 

121340 

Sp7  transcription  factor 

STAT5A 

6776 

signal  transducer  and  activator  of  transcription  5A 

TFEC 

22797 

transcription  factor  EC 

YY1 

7528 

YY1  transcription  factor 

Chemokines  and  Chemokine  Receptors 

CCL5 

6352 

chemokine  (C-C  motif)  ligand  5 

CCR5 

1234 

chemokine  (C-C  motif)  receptor  5 

CCR7 

1236 

chemokine  (C-C  motif)  receptor  7 

Cell  Division/Cell  Cycle  Control 

CCND1 

595 

cyclin  D1 

CCND2 

894 

cyclin  D2 

Interleukin  Cytokines 

IL1B 

3553 

interleukin  1 ,  beta 

IL2 

3558 

interleukin  2 

IL8 

3576 

interleukin  8 

Completion  of  Goal  2.  Initiate  the  writing  of  a  scientific  article  for  publication  in  a  peer  reviewed  scientific 
journal. 

Milestone  2A.  Outline  a  manuscript  describing  the  results  of  our  research  into  recognizing  co-expression 
correlations  between  genes.  The  research  being  carried  out  as  a  result  of  this  DOD  funding  mechanism  has 
been  greatly  enhanced  by  our  collaborations.  In  particular,  the  study  of  gene  co-expression  networks  is  being 
pursued  in  close  collaboration  with  Kathryn  Roeder's  research  team  at  Carnegie  Mellon  University.  As  a  result 
of  our  work  together  her  group  has  published  a  paper  describing  methods  they  have  developed  for  working 
with  high  dimensional  datasets  (Liu  et  al. ,  2010).  In  that  manuscript  they  present  a  new  method  for  detecting 
gene  co-expression  correlations.  The  methodology  generates  with  high  probability  true  interactions  and  is 
independent  of  sample  size  and  dimensionality  of  the  dataset  (Figure  1).  Prior  to  development  of  the  current 
method,  standard  techniques  used  Bayesian  information  criteria  (BIC).  Unfortunately,  BIC  can  perform  poorly 
when  dimensionality  of  a  dataset  is  large  relative  to  sample  size  as  commonly  occurs  when  working  with  gene 
expression  data. 

In  contrast,  the  new  approach  (denoted  by  the  acronym  StARS)  chooses  the  network  regularization  parameter 
so  that  the  resulting  network  is  sparse  without  excessive  variability  across  subsets  of  the  network.  This  is 
accomplished  by  incrementally  reducing  the  level  of  regularization  while  monitoring  variability  between  sub¬ 
samples.  Regularization  is  performed  until  the  variability  between  the  resulting  networks  is  minimized.  In 
applying  the  method  to  problems  like  gene  regulatory  networks  the  aim  is  to  investigate  the  interaction  of  many 
genes  (see  Figure  1  for  an  illustration).  We  have  chosen  to  tolerate  a  few  false  positive  interactions  so  long  as 
false  negatives  are  minimal  (notice  that  StARS  versus  BIC  results  in  many  fewer  false  edges).  In  contrast,  to 
our  method  the  BIC  approach  frequently  recognizes  a  very  large  number  of  false  positives.  Thus,  creating  a 
difficult  situation  when  designing  biological  experiments  to  test  the  conclusions  predicted  by  the  network 
model. 
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Figure  1.  Comparison  of  the  StARS  method  developed  during  the  DOD  funded  project  with  BIC.  Data  used  assumed  a 
sample  size  of  n=400  and  dimensionality  of  the  data  equal  to  100.  Note  the  high  degree  of  similarity  between  edges 
defined  by  the  True  graph  and  the  StARS  approach  compared  with  the  excessive  number  of  edges  identified  using  the 
BIC  method. 


Statement  of  Plans  for  the  Upcoming  6-Month  Research  Period 

Goal  1.  Using  the  data  available  on  human  BLCL  and  liver  samples  continue  to  develop  mathematical  models 
for  identifying  gene  co-expression  networks. 

Milestone  1A.  Analyze  gene  expression  data  collected  from  human  samples  for  the  presence  of  gene 
networks. 

Milestone  IB.  Incorporate  human  phenotype  and  demographic  data  into  the  analysis  of  co-expression 
networks. 

Goal  2.  Incorporate  gene  expression  and  phenotype  data  from  animal  models  into  the  analysis  of  gene  co¬ 
expression  networks. 

Milestone  2A.  Identify  gene  expression  datasets  collected  from  mouse  models  of  human  disease,  including 
diabetes. 
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KEY  RESEARCH  ACCOMPLISHMENTS: 


1.  Creation  of  a  data  repository  containing  gene  expression  data  from  greater  than  n=700  human  subjects  and 
40,000  genes  for  each  subject. 

2.  Development  of  mathematical  models  for  recognizing  and  evaluating  the  correlations  between  co-expressed 
genes. 

3.  Identification  of  gene  networks  and  network  neighborhoods  correlating  with  human  disease  phenotypes. 

4.  Publication  of  6  manuscripts. 

REPORTABLE  OUTCOMES: 

Manuscripts  (6  publications) 

1.  Lu,  L.,  Boehm,  J.,  Nichol,  L.,  Trucco,  M.,  and  Ringquist,  S.  Multiplex  HLA  typing  by  pyrosequencing.  In 
Methods  in  Molecular  Biology,  vol  496:  DNA  and  RNA  Profiling  in  Human  Blood,  ed.  P.  Bugert.  Humana 
Press  Inc.,  Totowa,  New  Jersey  (2009). 

2.  Kim,  D.H.,  Ringquist,  S.,  and  Dong,  H.H.  Fructose  -  A  sweet  risk  of  fatty  liver  disease.  In:  Chocolate, 
Fast  Foods  and  Sweeteners:  Consumption  and  Health,  ed.  M.R.  Bishop.  Nova  Publishers  Inc.,  (2010). 

3.  Wu,  J.,  Devlin,  B.,  Ringquist,  S.,  Trucco,  M.,  and  Roeder,  K.  Screen  and  clean:  a  tool  for  identifying 
interactions  in  genome-wide  association  studies.  Genetic  Epidemiology  34,  275-285  (2010). 

4.  Kim,  D.H.,  Zhang,  T.,  Ringquist,  S.,  and  Dong,  H.H.  Targeting  FoxOI  for  hypertriglyceridemia.  Current 
Drug  Targets  (in  press). 

5.  Kamagate,  A.,  Kim,  D.H.,  Zhang,  T.,  Slusher  S.,  Strom,  S.C.,  Bertera,  S.,  Ringquist,  S.,  and  Dong,  H.H. 
FoxOI  links  hepatic  insulin  action  to  endoplasmic  reticulum  stress.  Endocrinology  151, 3521-3535  (2010). 

6.  Crossett,  A.,  Kent,  B.P.,  Klei,  L.,  Ringquist,  S.,  Trucco,  M.,  Roeder,  K.,  and  Devlin,  B.  Using  ancestry 
matching  to  combine  family-based  and  unrelated  samples  for  genome-wide  association  studies.  Annals  of 
Statistics  (in  press). 

Development  of  Cell  Lines,  Tissue  or  Serum  Repositories 

1 .  Repository  of  DNA  samples  collected  from  T 1 D  and  T 1  DN  patients  exceeding  1 ,800  subjects. 

CONCLUSION: 

The  conclusions  from  the  current  year  of  funding  are  that  mathematical  models  designed  to  recognize  gene  co¬ 
expression  correlations  as  well  as  gene  networks  and  highly  interconnected  sub-networks  can  be 
accomplished  with  high  probability  of  identifying  true  correlations.  Correlated  gene  pairs  and  neighborhood 
groups  will,  in  turn,  provide  evidence  for  their  role  in  human  disease  phenotypes.  The  work  planned  in  the 
upcoming  year  will  test  the  hypothesis  that  study  of  co-expression  network  in  cells  isolated  from  diabetic 
patients  when  compared  with  cell  collected  from  healthy  subjects  will  identify  disease  specific  group  of  genes 
and  reveal  molecular  genetic  pathways  leading  to  disease  susceptibility. 

The  research  project  generated  6  publications.  These  are  listed  under  the  section  entitled  "REPORTABLE 
OUTCOMES". 

The  So  What  Section.  What  are  the  implications  of  this  research?  Diabetes  affects  16  million  Americans  and 
800,000  new  cases  annually.  African,  Hispanic,  Native  and  Asian  Americans  are  particularly  susceptible  to  its 
most  severe  complications.  Costs  associated  with  diabetes  may  be  as  high  as  $132  billion.  Diabetes 
accounts  for  42%  of  new  cases  of  end-stage  renal  disease  with  over  new  100,000  cases  per  year  at  an 
average  cost  of  $55,000  per  patient  annually. 

What  are  the  military  significance  and  public  purpose  of  this  research?  As  the  military  is  a  reflection  of  the  U.S. 
population  improved  prediction  of  risk  for  developing  diabetes  and  diabetic  complications  among  active  duty 
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members  of  the  military,  their  families,  and  retired  military  personnel  will  potentially  allow  focused  preventative 
treatment  of  at  risk  individuals,  providing  significant  healthcare  savings  and  improved  patient  well  being. 
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